Random simulations of a datatable for efficiently mining reliable and non-redundant itemsets

نویسندگان

  • Martine Cadot
  • Pascal Cuxac
  • Alain Lelu
چکیده

Our goal is twofold: 1) we want to mine the only statistically valid 2-itemsets out of a boolean datatable, 2) on this basis, we want to build the only higher-order non-redundant itemsets compared to their sub-itemsets. For the first task we have designed a randomization test (Tournebool) respectful of the structure of the data variables and independant from the specific distributions of the data. In our test set (959 texts and 8477 terms), this leads to a reduction from 126, 000 2-itemsets to 13, 000 significant ones, at the 99% confidence interval. For the second task, we have devised a hierarchical stepwise procedure (MIDOVA) for evaluating the residual amount of variation devoted to higher-order itemsets, yielding new possible positive or negative high-order relations. On our example, this leads to counts of 7,712 for 2-itemsets to 3 for 6-itemsets, and no higher-order ones, in a computationally efficient way.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data sanitization in association rule mining based on impact factor

Data sanitization is a process that is used to promote the sharing of transactional databases among organizations and businesses, it alleviates concerns for individuals and organizations regarding the disclosure of sensitive patterns. It transforms the source database into a released database so that counterparts cannot discover the sensitive patterns and so data confidentiality is preserved ag...

متن کامل

Mining Non- Redundant Frequent Pattern in Taxonomy Datasets using Concept Lattices

In general frequent itemsets are generated from large data sets by applying various association rule mining algorithms, these produce many redundant frequent itemsets. In this paper we proposed a new framework for Non-redundant frequent itemset generation using closed frequent itemsets without lose of information on Taxonomy Datasets using concept lattices. General Terms Frequent Pattern, Assoc...

متن کامل

An Efficient Three-Scan Approach for Mining High Utility Itemsets

Utility mining finds out high utility itemsets by considering both the profits and quantities of items in transactions. In this paper, a three-scan mining approach is proposed to efficiently discover high utility itemsets from transaction databases. The proposed approach utilizes an itemset-generation mechanism to prune redundant candidates early and to systematically check the itemsets from tr...

متن کامل

Reliable representations for association rules

Association rule mining has contributed to many advances in the area of knowledge discovery. However, the quality of the discovered association rules is a big concern and has drawn more and more attention recently. One problem with the quality of the discovered association rules is the huge size of the extracted rule set. Often for a dataset, a huge number of rules can be extracted, but many of...

متن کامل

Using attribute value lattice to find closed frequent itemsets

Finding all closed frequent itemsets is a key step of association rule mining since the non-redundant association rule can be inferred from all the closed frequent itemsets. In this paper we present a new method for finding closed frequent itemsets based on attribute value lattice. In the new method, we argue that vertical data representation and attribute value lattice can find all closed freq...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007